Bivariate Analysis
Through bivariate analysis we try to analyze two variables simultaneously. As opposed to univariate analysis where we check the characteristics of a single variable, in bivariate analysis we try to determine if there is any relationship between two variables.
There are essentially 3 major scenarios that we will come across when we perform bivariate analysis
For the purpose of this exercise, we will explore few most popular techniques to perform bivariate analysis.
The following plots are not limited to the headings they are under. They are the options we have if we face a certain scenario.
Our aim is to explore the data of suicide rates.
source: https://www.kaggle.com/russellyates88/suicide-rates-overview-1985-to-2016
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set_style('darkgrid')
data = pd.read_csv('master.csv')
data.head()
| country | year | sex | age | suicides_no | population | suicides/100k pop | country-year | HDI for year | gdp_for_year ($) | gdp_per_capita ($) | generation | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Albania | 1987 | male | 15-24 years | 21 | 312900 | 6.71 | Albania1987 | NaN | 2,156,624,900 | 796 | Generation X |
| 1 | Albania | 1987 | male | 35-54 years | 16 | 308000 | 5.19 | Albania1987 | NaN | 2,156,624,900 | 796 | Silent |
| 2 | Albania | 1987 | female | 15-24 years | 14 | 289700 | 4.83 | Albania1987 | NaN | 2,156,624,900 | 796 | Generation X |
| 3 | Albania | 1987 | male | 75+ years | 1 | 21800 | 4.59 | Albania1987 | NaN | 2,156,624,900 | 796 | G.I. Generation |
| 4 | Albania | 1987 | male | 25-34 years | 9 | 274300 | 3.28 | Albania1987 | NaN | 2,156,624,900 | 796 | Boomers |
# Check the data describe
data.describe()
| year | suicides_no | population | suicides/100k pop | HDI for year | gdp_per_capita ($) | |
|---|---|---|---|---|---|---|
| count | 27820.000000 | 27820.000000 | 2.782000e+04 | 27820.000000 | 8364.000000 | 27820.000000 |
| mean | 2001.258375 | 242.574407 | 1.844794e+06 | 12.816097 | 0.776601 | 16866.464414 |
| std | 8.469055 | 902.047917 | 3.911779e+06 | 18.961511 | 0.093367 | 18887.576472 |
| min | 1985.000000 | 0.000000 | 2.780000e+02 | 0.000000 | 0.483000 | 251.000000 |
| 25% | 1995.000000 | 3.000000 | 9.749850e+04 | 0.920000 | 0.713000 | 3447.000000 |
| 50% | 2002.000000 | 25.000000 | 4.301500e+05 | 5.990000 | 0.779000 | 9372.000000 |
| 75% | 2008.000000 | 131.000000 | 1.486143e+06 | 16.620000 | 0.855000 | 24874.000000 |
| max | 2016.000000 | 22338.000000 | 4.380521e+07 | 224.970000 | 0.944000 | 126352.000000 |
#check the number of Columns
data.columns
Index(['country', 'year', 'sex', 'age', 'suicides_no', 'population',
'suicides/100k pop', 'country-year', 'HDI for year',
' gdp_for_year ($) ', 'gdp_per_capita ($)', 'generation'],
dtype='object')
#Check the shape of the data
data.shape
(27820, 12)
#count the data types
data.dtypes
country object year int64 sex object age object suicides_no int64 population int64 suicides/100k pop float64 country-year object HDI for year float64 gdp_for_year ($) object gdp_per_capita ($) int64 generation object dtype: object
data.dtypes.value_counts()
object 6 int64 4 float64 2 dtype: int64
#check the dataset info
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 27820 entries, 0 to 27819 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 country 27820 non-null object 1 year 27820 non-null int64 2 sex 27820 non-null object 3 age 27820 non-null object 4 suicides_no 27820 non-null int64 5 population 27820 non-null int64 6 suicides/100k pop 27820 non-null float64 7 country-year 27820 non-null object 8 HDI for year 8364 non-null float64 9 gdp_for_year ($) 27820 non-null object 10 gdp_per_capita ($) 27820 non-null int64 11 generation 27820 non-null object dtypes: float64(2), int64(4), object(6) memory usage: 2.5+ MB
#Chcecking the missing values
data.isnull().sum().sort_values(ascending = False)
HDI for year 19456 country 0 year 0 sex 0 age 0 suicides_no 0 population 0 suicides/100k pop 0 country-year 0 gdp_for_year ($) 0 gdp_per_capita ($) 0 generation 0 dtype: int64
data.shape[0]
27820
#write a function that take dataframe and give the total and percentage missing values
def missing_data(df):
total = df.isnull().sum().sort_values(ascending = False)
count = df.shape[0]
percent = round(total / count,2) * 100
value = pd.concat([total,percent],axis = 1,keys=['Total','Percent'])
return value
result = missing_data(data)
result
| Total | Percent | |
|---|---|---|
| HDI for year | 19456 | 70.0 |
| country | 0 | 0.0 |
| year | 0 | 0.0 |
| sex | 0 | 0.0 |
| age | 0 | 0.0 |
| suicides_no | 0 | 0.0 |
| population | 0 | 0.0 |
| suicides/100k pop | 0 | 0.0 |
| country-year | 0 | 0.0 |
| gdp_for_year ($) | 0 | 0.0 |
| gdp_per_capita ($) | 0 | 0.0 |
| generation | 0 | 0.0 |
#descriptive stats of continuous columns
data.columns
Index(['country', 'year', 'sex', 'age', 'suicides_no', 'population',
'suicides/100k pop', 'country-year', 'HDI for year',
' gdp_for_year ($) ', 'gdp_per_capita ($)', 'generation'],
dtype='object')
data[['suicides_no','population','suicides/100k pop','HDI for year',' gdp_for_year ($) ','gdp_per_capita ($)']].describe()
| suicides_no | population | suicides/100k pop | HDI for year | gdp_per_capita ($) | |
|---|---|---|---|---|---|
| count | 27820.000000 | 2.782000e+04 | 27820.000000 | 8364.000000 | 27820.000000 |
| mean | 242.574407 | 1.844794e+06 | 12.816097 | 0.776601 | 16866.464414 |
| std | 902.047917 | 3.911779e+06 | 18.961511 | 0.093367 | 18887.576472 |
| min | 0.000000 | 2.780000e+02 | 0.000000 | 0.483000 | 251.000000 |
| 25% | 3.000000 | 9.749850e+04 | 0.920000 | 0.713000 | 3447.000000 |
| 50% | 25.000000 | 4.301500e+05 | 5.990000 | 0.779000 | 9372.000000 |
| 75% | 131.000000 | 1.486143e+06 | 16.620000 | 0.855000 | 24874.000000 |
| max | 22338.000000 | 4.380521e+07 | 224.970000 | 0.944000 | 126352.000000 |
# crosstab group by a series by age
pd.crosstab(index = data['age'], columns='count')
| col_0 | count |
|---|---|
| age | |
| 15-24 years | 4642 |
| 25-34 years | 4642 |
| 35-54 years | 4642 |
| 5-14 years | 4610 |
| 55-74 years | 4642 |
| 75+ years | 4642 |
#check the highest number of suicides
data.groupby('country')['suicides_no'].sum().reset_index().sort_values('suicides_no',ascending = False).head(10).plot(x = 'country', y = 'suicides_no',kind='bar',figsize=(15,5))
<AxesSubplot:xlabel='country'>
#Check the lowest number of suicides
data.groupby('country')['suicides_no'].sum().reset_index().sort_values(['suicides_no'],ascending = True).head(10).plot(x = 'country', y = 'suicides_no',kind='bar',figsize = (15,5))
<AxesSubplot:xlabel='country'>
plt.figure(figsize=(15,5))#increase the screen size to (15,5)
sns.barplot(x = 'age', y ='suicides_no',data=data)
<AxesSubplot:xlabel='age', ylabel='suicides_no'>
plt.figure(figsize = (8,4))
sns.barplot(x='sex', y= 'suicides_no',data = data)
<AxesSubplot:xlabel='sex', ylabel='suicides_no'>
plt.figure(figsize = (15,5))
sns.barplot(x = 'generation', y ='suicides_no', data = data)
<AxesSubplot:xlabel='generation', ylabel='suicides_no'>
plt.figure(figsize=(15,5))
sns.scatterplot(x = "population", y = 'suicides_no', data = data,hue ='sex'),
(<AxesSubplot:xlabel='population', ylabel='suicides_no'>,)
figure = plt.figure(figsize=(50,15))
ax = sns.regplot(x='population',y='suicides_no', data=data ) # regression plot - scatter plot with a regression line
#Here we plotting a line plot.
sns.lineplot(x='population',y='suicides_no', data=data.head(10) )
<AxesSubplot:xlabel='population', ylabel='suicides_no'>
data.columns
Index(['country', 'year', 'sex', 'age', 'suicides_no', 'population',
'suicides/100k pop', 'country-year', 'HDI for year',
' gdp_for_year ($) ', 'gdp_per_capita ($)', 'generation'],
dtype='object')
plt.figure(figsize = (15,5))
sns.scatterplot(x ='gdp_per_capita ($)', y ='suicides/100k pop',data = data )
<AxesSubplot:xlabel='gdp_per_capita ($)', ylabel='suicides/100k pop'>
plt.figure(figsize = (50,15))
sns.regplot(x ='gdp_per_capita ($)', y ='suicides/100k pop',data = data )
<AxesSubplot:xlabel='gdp_per_capita ($)', ylabel='suicides/100k pop'>
data.corr()
| year | suicides_no | population | suicides/100k pop | HDI for year | gdp_per_capita ($) | |
|---|---|---|---|---|---|---|
| year | 1.000000 | -0.004546 | 0.008850 | -0.039037 | 0.366786 | 0.339134 |
| suicides_no | -0.004546 | 1.000000 | 0.616162 | 0.306604 | 0.151399 | 0.061330 |
| population | 0.008850 | 0.616162 | 1.000000 | 0.008285 | 0.102943 | 0.081510 |
| suicides/100k pop | -0.039037 | 0.306604 | 0.008285 | 1.000000 | 0.074279 | 0.001785 |
| HDI for year | 0.366786 | 0.151399 | 0.102943 | 0.074279 | 1.000000 | 0.771228 |
| gdp_per_capita ($) | 0.339134 | 0.061330 | 0.081510 | 0.001785 | 0.771228 | 1.000000 |
plt.figure(figsize = (10,10))
sns.heatmap(data.corr(),annot = True,fmt='.2f',linewidths=0.5,center = 1)
<AxesSubplot:>
plt.figure(figsize = (15,5))
sns.barplot(x= 'sex',y = 'suicides_no', data = data,hue ='age')
<AxesSubplot:xlabel='sex', ylabel='suicides_no'>
plt.figure(figsize=(15,5))
sns.barplot(data=data,x='sex',y='suicides_no',hue='generation')
plt.show()
data.head()
data1 = data.groupby("country")['suicides_no'].sum().reset_index().sort_values("suicides_no",ascending = False)
plt.figure(figsize = (15,5))
sns.barplot(x= "country", y = "suicides_no",data=data1.head(10))
<AxesSubplot:xlabel='country', ylabel='suicides_no'>
#sort values by country,sex vs suicides_no
data2 = data.groupby(["country","sex"])['suicides_no'].sum().reset_index().sort_values("suicides_no",ascending = False)
plt.figure(figsize = (15,10))
sns.barplot(x= "country", y = "suicides_no",data=data2.head(10),hue ='sex')
<AxesSubplot:xlabel='country', ylabel='suicides_no'>
plt.figure(figsize = (15,10))
sns.pointplot(x="generation", y="suicides_no", data=data)
<AxesSubplot:xlabel='generation', ylabel='suicides_no'>
data.columns
Index(['country', 'year', 'sex', 'age', 'suicides_no', 'population',
'suicides/100k pop', 'country-year', 'HDI for year',
' gdp_for_year ($) ', 'gdp_per_capita ($)', 'generation'],
dtype='object')
plt.figure(figsize = (15,5))
sns.violinplot(x = data['generation'], y = data['population'], data = data)
<AxesSubplot:xlabel='generation', ylabel='population'>
plt.figure(figsize = (15,5))
sns.boxplot(x = data['generation'], y = data['population'], data = data)
<AxesSubplot:xlabel='generation', ylabel='population'>
Checking pattern using Trend plot (1985-2015) suides Rate Vs Years
data[["suicides_no","year"]]
#data.columns
| suicides_no | year | |
|---|---|---|
| 0 | 21 | 1987 |
| 1 | 16 | 1987 |
| 2 | 14 | 1987 |
| 3 | 1 | 1987 |
| 4 | 9 | 1987 |
| ... | ... | ... |
| 27815 | 107 | 2014 |
| 27816 | 9 | 2014 |
| 27817 | 60 | 2014 |
| 27818 | 44 | 2014 |
| 27819 | 21 | 2014 |
27820 rows × 2 columns
year_sui = data[["suicides_no","year"]].groupby('year').sum()
year_sui.head()
| suicides_no | |
|---|---|
| year | |
| 1985 | 116063 |
| 1986 | 120670 |
| 1987 | 126842 |
| 1988 | 121026 |
| 1989 | 160244 |
year_sui.plot(figsize=(10,6))
<AxesSubplot:xlabel='year'>
year_pop = data[["year","population"]].groupby('year').sum()
year_pop.head()
| population | |
|---|---|
| year | |
| 1985 | 1008600086 |
| 1986 | 1029909613 |
| 1987 | 1095029726 |
| 1988 | 1054094424 |
| 1989 | 1225514347 |
year_pop.plot(figsize = (10,6))
<AxesSubplot:xlabel='year'>
pandas_profiling extends the pandas DataFrame with df.profile_report() for quick data analysis.
df = pd.read_csv('match_data.csv')
df.head()
| id | season | city | date | team1 | team2 | toss_winner | toss_decision | winner | eliminator | ... | overs | player_of_match | venue | umpire1 | umpire2 | umpire3 | first_bat_team | first_bowl_team | first_bat_score | second_bat_score | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 419164 | 2010 | Mumbai | 2010-04-24 | Royal Challengers Bangalore | Deccan Chargers | Deccan Chargers | bat | Royal Challengers Bangalore | NaN | ... | 20 | A Kumble | Dr DY Patil Sports Academy | RE Koertzen | SJA Taufel | NaN | Deccan Chargers | Royal Challengers Bangalore | 82.0 | 86.0 |
| 1 | 419131 | 2010 | Delhi | 2010-03-29 | Delhi Daredevils | Kolkata Knight Riders | Delhi Daredevils | bat | Delhi Daredevils | NaN | ... | 20 | DA Warner | Feroz Shah Kotla | SS Hazare | SJA Taufel | NaN | Delhi Daredevils | Kolkata Knight Riders | 177.0 | 137.0 |
| 2 | 336021 | 2008 | Mumbai | 2008-05-16 | Mumbai Indians | Kolkata Knight Riders | Mumbai Indians | field | Mumbai Indians | NaN | ... | 20 | SM Pollock | Wankhede Stadium | BR Doctrove | DJ Harper | NaN | Kolkata Knight Riders | Mumbai Indians | 67.0 | 68.0 |
| 3 | 980931 | 2016 | Pune | 2016-04-22 | Rising Pune Supergiants | Royal Challengers Bangalore | Rising Pune Supergiants | field | Royal Challengers Bangalore | NaN | ... | 20 | AB de Villiers | Maharashtra Cricket Association Stadium | CB Gaffaney | VK Sharma | NaN | Royal Challengers Bangalore | Rising Pune Supergiants | 185.0 | 172.0 |
| 4 | 419163 | 2010 | Mumbai | 2010-04-22 | Chennai Super Kings | Deccan Chargers | Chennai Super Kings | bat | Chennai Super Kings | NaN | ... | 20 | DE Bollinger | Dr DY Patil Sports Academy | BR Doctrove | RB Tiffin | NaN | Chennai Super Kings | Deccan Chargers | 142.0 | 104.0 |
5 rows × 24 columns
It has many useful functionalities but the best one is to generate an EDA report as given below.
#Installation step
#!pip install pandas-profiling
#or
import sys
!{sys.executable} -m pip install pandas-profiling
Requirement already satisfied: pandas-profiling in c:\users\chonc\anaconda3\envs\panda_playground\lib\site-packages (3.1.0) Requirement already satisfied: tqdm>=4.48.2 in c:\users\chonc\anaconda3\envs\panda_playground\lib\site-packages (from pandas-profiling) (4.62.3) Requirement already satisfied: markupsafe~=2.0.1 in c:\users\chonc\anaconda3\envs\panda_playground\lib\site-packages (from pandas-profiling) (2.0.1) Requirement already satisfied: requests>=2.24.0 in c:\users\chonc\anaconda3\envs\panda_playground\lib\site-packages (from pandas-profiling) (2.26.0) Requirement already satisfied: joblib~=1.0.1 in c:\users\chonc\anaconda3\envs\panda_playground\lib\site-packages (from pandas-profiling) (1.0.1) Requirement already satisfied: visions[type_image_path]==0.7.4 in c:\users\chonc\anaconda3\envs\panda_playground\lib\site-packages (from pandas-profiling) (0.7.4) Requirement already satisfied: numpy>=1.16.0 in c:\users\chonc\anaconda3\envs\panda_playground\lib\site-packages (from pandas-profiling) (1.20.3) Requirement already satisfied: phik>=0.11.1 in c:\users\chonc\anaconda3\envs\panda_playground\lib\site-packages (from pandas-profiling) (0.12.0) Requirement already satisfied: pandas!=1.0.0,!=1.0.1,!=1.0.2,!=1.1.0,>=0.25.3 in c:\users\chonc\anaconda3\envs\panda_playground\lib\site-packages (from pandas-profiling) (1.3.2) Requirement already satisfied: pydantic>=1.8.1 in c:\users\chonc\anaconda3\envs\panda_playground\lib\site-packages (from pandas-profiling) (1.8.2) Requirement already satisfied: PyYAML>=5.0.0 in c:\users\chonc\anaconda3\envs\panda_playground\lib\site-packages (from pandas-profiling) (6.0) Requirement already satisfied: htmlmin>=0.1.12 in c:\users\chonc\anaconda3\envs\panda_playground\lib\site-packages (from pandas-profiling) (0.1.12) Requirement already satisfied: missingno>=0.4.2 in c:\users\chonc\anaconda3\envs\panda_playground\lib\site-packages (from pandas-profiling) (0.5.0) Requirement already satisfied: seaborn>=0.10.1 in c:\users\chonc\anaconda3\envs\panda_playground\lib\site-packages (from pandas-profiling) (0.11.2) Requirement already satisfied: multimethod>=1.4 in c:\users\chonc\anaconda3\envs\panda_playground\lib\site-packages (from pandas-profiling) (1.6) Requirement already satisfied: tangled-up-in-unicode==0.1.0 in c:\users\chonc\anaconda3\envs\panda_playground\lib\site-packages (from pandas-profiling) (0.1.0) Requirement already satisfied: jinja2>=2.11.1 in c:\users\chonc\anaconda3\envs\panda_playground\lib\site-packages (from pandas-profiling) (3.0.1) Requirement already satisfied: matplotlib>=3.2.0 in c:\users\chonc\anaconda3\envs\panda_playground\lib\site-packages (from pandas-profiling) (3.4.2) Requirement already satisfied: scipy>=1.4.1 in c:\users\chonc\anaconda3\envs\panda_playground\lib\site-packages (from pandas-profiling) (1.7.1) Requirement already satisfied: attrs>=19.3.0 in c:\users\chonc\anaconda3\envs\panda_playground\lib\site-packages (from visions[type_image_path]==0.7.4->pandas-profiling) (21.2.0) Requirement already satisfied: networkx>=2.4 in c:\users\chonc\anaconda3\envs\panda_playground\lib\site-packages (from visions[type_image_path]==0.7.4->pandas-profiling) (2.6.3) Requirement already satisfied: Pillow in c:\users\chonc\anaconda3\envs\panda_playground\lib\site-packages (from visions[type_image_path]==0.7.4->pandas-profiling) (8.3.1) Requirement already satisfied: imagehash in c:\users\chonc\anaconda3\envs\panda_playground\lib\site-packages (from visions[type_image_path]==0.7.4->pandas-profiling) (4.2.1) Requirement already satisfied: cycler>=0.10 in c:\users\chonc\anaconda3\envs\panda_playground\lib\site-packages (from matplotlib>=3.2.0->pandas-profiling) (0.10.0) Requirement already satisfied: python-dateutil>=2.7 in c:\users\chonc\anaconda3\envs\panda_playground\lib\site-packages (from matplotlib>=3.2.0->pandas-profiling) (2.8.2) Requirement already satisfied: pyparsing>=2.2.1 in c:\users\chonc\anaconda3\envs\panda_playground\lib\site-packages (from matplotlib>=3.2.0->pandas-profiling) (2.4.7) Requirement already satisfied: kiwisolver>=1.0.1 in c:\users\chonc\anaconda3\envs\panda_playground\lib\site-packages (from matplotlib>=3.2.0->pandas-profiling) (1.3.1) Requirement already satisfied: six in c:\users\chonc\anaconda3\envs\panda_playground\lib\site-packages (from cycler>=0.10->matplotlib>=3.2.0->pandas-profiling) (1.16.0) Requirement already satisfied: pytz>=2017.3 in c:\users\chonc\anaconda3\envs\panda_playground\lib\site-packages (from pandas!=1.0.0,!=1.0.1,!=1.0.2,!=1.1.0,>=0.25.3->pandas-profiling) (2021.1) Requirement already satisfied: typing-extensions>=3.7.4.3 in c:\users\chonc\anaconda3\envs\panda_playground\lib\site-packages (from pydantic>=1.8.1->pandas-profiling) (3.10.0.2) Requirement already satisfied: charset-normalizer~=2.0.0 in c:\users\chonc\anaconda3\envs\panda_playground\lib\site-packages (from requests>=2.24.0->pandas-profiling) (2.0.7) Requirement already satisfied: certifi>=2017.4.17 in c:\users\chonc\anaconda3\envs\panda_playground\lib\site-packages (from requests>=2.24.0->pandas-profiling) (2021.5.30) Requirement already satisfied: urllib3<1.27,>=1.21.1 in c:\users\chonc\anaconda3\envs\panda_playground\lib\site-packages (from requests>=2.24.0->pandas-profiling) (1.26.7) Requirement already satisfied: idna<4,>=2.5 in c:\users\chonc\anaconda3\envs\panda_playground\lib\site-packages (from requests>=2.24.0->pandas-profiling) (3.3) Requirement already satisfied: colorama in c:\users\chonc\anaconda3\envs\panda_playground\lib\site-packages (from tqdm>=4.48.2->pandas-profiling) (0.4.4) Requirement already satisfied: PyWavelets in c:\users\chonc\anaconda3\envs\panda_playground\lib\site-packages (from imagehash->visions[type_image_path]==0.7.4->pandas-profiling) (1.1.1)
#import pandas_profiling
import pandas_profiling
#Getting the pandas profiling report
pandas_profiling.ProfileReport(df)
#Getting an html file as output here
pandas_profiling.ProfileReport(df).to_file("pandas_profiling.html")
##### Thanks for your visiting.